Enhancing cancer differentiation with synthetic MRI examinations via generative models: a systematic review

Contemporary deep learning-based decision systems are well-known for requiring high-volume datasets in order to produce generalized, reliable, and high-performing models. However, the collection of such datasets is challenging, requiring time-consuming processes involving also expert clinicians with limited time. In addition, data collection often raises ethical and legal issues and depends on costly and invasive procedures. Deep generative models such as generative adversarial networks and variational autoencoders can capture the underlying distribution of the examined data, allowing them to create new and unique instances of samples. This study aims to shed light on generative data augmentation techniques and corresponding best practices. Through in-depth investigation, we underline the limitations and potential methodology pitfalls from critical standpoint and aim to promote open science research by identifying publicly available open-source repositories and datasets.


Introduction
Deep learning (DL) architectures gained immense popularity in the past few years and specifically when AlexNet [1] has shown outstanding performance and won the ImageNet competition [2] by a large margin compared to the then state-of-the-art machine learning models: however, the history of deep learning began several years ago. Initially, the inspiration emerged from the structure of the human brain. This led to artificial neural networks (ANN) designed for understanding the functionality of the brain [3] and inspired on conceptual level deep learning but not by attempting to match the low-level neural response itself [4]. From the 1940s to 1960s, ANN was known as cybernetics [5] where [6,7] was the pioneers in developing theories of biological learning. Almost one decade after the implementation of the first analog model, the perceptron was introduced by Rosenblatt et al. [8] which could learn the weights for classifying samples based on previously seen examples from each category. The idea of using these networks became irrelevant for the next four decades, as ANN-based models could not successfully perform complex pattern recognition tasks using binary neurons. Nevertheless, the research continued, and when computers started to become fast enough, the idea of back-propagation using continuous neurons to operate floating-point multiplications emerged. Consequently, this led to the training of a neural network with one or two hidden layers [9,10]. During the 1980s-1990s, the name switched to connectionism until 2006 when hardware advancements made feasible the stacking of more nonlinear layers and the term "deep" appeared and prevailed up to this day. Inspiration from neuroscience and specifically by the structure of the mammalian was critical one more time when the neocognitron [11] presented, an innovative and powerful architecture for image processing, which led [12] to the introduction of convolutional networks (ConvNets) featuring supervised learning algorithms such as backpropagation and end-to-end image analysis.
Applications of ConvNets in medical image analysis were first applied but with limited success in the early 1990s by [13,14] mainly for detection of micro-calcifications in digital mammography and detection of lung nodules in chest radiographs [15]. Recently these models have been extended to a wide range of applications such as segmentation of nodules in computerized tomography (CT) images [16], organ segmentation [17], super-resolution [18], denoising [19] and cross-modality synthesis [20]. ConvNets are widely used deep architectures in the current state-of-the-art exploiting properties such as stationarity, locality and compositionality.
A key differentiating factor from other imaging domains is that magnetic resonance imaging (MRI) poses some significant challenges regarding the data collection because of the lack of tissue-specific values, different anatomical areas, varying imaging modalities ( Fig. 1) different scanners and the absence of the imaging standardization across different vendors. The wide range of image acquisition protocols also contributes to the limited stability of deep models and can potentially be an impediment to a robust and generalized decision support system.
The scarcity of large and diverse patient cohorts with high-quality clinical data for specific clinical outcomes has been reported to be the most significant drawback of using DL in medical imaging tasks [21]. Data augmentation significantly enhances the convergence of DL models by synthetically generating new training samples. This can be achieved by incorporating trivial image processing techniques such as deformation, mirroring, flipping, zooming, cropping, rotating and other methods. Fig. 1 The MRI sequences were used in the examined studies. The brain anatomical region included the most studies which justify the reason that T1 contrast-enhanced (T1ce), T2-weighted (T2w), fluid-attenuated inversion recovery (FLAIR) and T1-weighted (T1w) modalities prevail in the above bar chart. Apparent diffusion coefficient, diffusion (ADC), K trans , and dynamic contrast-enhanced (DCE) modalities were examined in the prostate anatomical region. The remained anatomical regions (i.e., pancreas, breast, liver) included as well the first four depicted modalities. Lastly, one study which concerned the brain region examined amide proton transfer weighted (APTw) modality Unsupervised learning techniques such as generative adversarial networks (GANs) [22] and variational autoencoders [23] (VAE) have recently revolutionized data generation. A large number of images can be produced by a deep generative model from a random noise input or a binary segmentation mask. The GANs are usually comprised of two networks the generator G that creates new samples from noise and the discriminator D that distinguishes among the valid and invalid synthetic samples. The convergence of these models is achieved simultaneously by an algorithm that is based on game theory with a minimax loss. VAEs are commonly used for feature extraction by compressing the input to a compact representation, but they have lately been employed for generative properties by manipulating the latent space.
Generation of synthetic data for cross-sectional imaging modalities in oncology is highly challenging since there is a significant underlying biological variability that leads to multiple phenotypic and genetic subtypes of cancerous tumors [24]. Additionally, in this review the focus on MRI was decided because of the unique properties and limitations of this modality, such as the lack of measured signals, dependence on the scanner vendor, acquisition parameters and image acquisition protocols. It is argued that a generalized generative model could potentially overcome some of these drawbacks by enhancing the diversity and size of the examined patient cohort. A few other reviews of GANs for various medical applications, including generating images, have been published. Singh et al. [25] focused mostly on the general technical attributes of GANs. Sorin et al. [26] reported general radiology applications such as reconstruction, denoising, generation, cross-modality translation and segmentation. Yi et al. [27] presented studies with GANs for varying imaging and clinical applications of medical imaging in general. Additionally, a preprint review [28] for GANs that extends the existing reviews by including patient privacy and lesion progression monitoring applications in generative cancer imaging was explored. Notably, Wei et al. [29] reported on VAE-based applications in biomedical informatics, including data generation.
The current literature review presents from a critical standpoint studies that implement novel deep generative data augmentation techniques on cancer MRI examinations, aiming to highlight the best techniques for synthetic tumor representations, identify best practices for data analysis, promote open science approaches based on widely available public data and open-access source code repositories for improved reproducibility. Another key element of this study is to report the most robust evaluation techniques of the generated samples, which include a visual assessment by experts, direct quantitative metrics prior to analysis and indirect methods from the performance enhancements of the downstream analysis.
The sections of the rest of the review are organized as follows: in sect. 2 the search methodology is presented, in sect. 3 key generative architectures used by the examined studies are analyzed, in sect. 4 details about the studies included in this review are reported, in sect. 5 the adopted evaluation methods, limitations and future remarks are discussed, and the final section concludes the review.

Search strategy
A systematic review of studies that use data augmentation techniques that enhance cancer differentiation via deep generative models was performed to assess the impact of GANs and VAE applications in oncology. This systematic review was conducted by following the reporting checklist of the Preferred Reporting Items for Reviews and Meta-Analyses (PRISMA) [30]. For the purpose of this study, a comprehensive literature search was undertaken to identify research papers employing a deep generative model to synthesize cancer MRI images. A protocol was developed in advance to document the analysis method and inclusion criteria. We utilized Scopus, PubMed, Google Scholar, IEEE, Web of Science and Arxiv websites. The papers were selected by querying Scopus and PubMed on peer-reviewed journals and conference/proceeding publications between January 1, 2017, and June 30, 2021. The query contained (("generative" "adversarial" "networks") OR ("GANs") OR ("Variational" "auto" "encoders") OR ("VAE")) AND (("magnetic" "resonance" "imaging") OR ("MRI") ) AND ("data" "augmentation") OR ("oncology" AND "synthe*"). Additionally, on the remaining websites, key words such as "GANs" or "VAE, " "data" "augmentation, " "synthetic" "MRI" "examination" "generation, " "oncology" were used.

Study selection
Two reviewers screened the titles, abstracts and conclusions of the records independently and papers that were clearly not related to the subject matter were discarded. During the first screening phase, the abstracts and conclusions of papers were carefully assessed. Despite the fact that the majority of these papers contained a subset of the searched keywords, a second fulltext screening showed that the methodology presented in many of these works was not relevant to deep learning architectures for data augmentation, and thus they were removed. In total, the current study reviewed 36 papers that were based either on widely used, open datasets or on custom in-house data. Most of the custom datasets had data from patients with neoplasms that had been verified by a biopsy, and one study was comprised of data from patients who had either a biopsy or a follow-up imaging examination within the past twelve months. Notably, these papers included different imaging protocols as depicted in Fig. 1. The study selection process is summarized in Fig. 2, and the publishing timeline of the examined studies is presented in Fig. 3.

Risk of bias assessment
The updated QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2) [31] criteria were employed by the two reviewers to evaluate the risk of bias and applicability of the included studies. Each item was rated as "low, " "high" or "unclear. " The item was scored as "unclear" when

Noise to image
Formulating the objective loss function of a GAN model leads the researchers to create a plethora of such models (Fig. 5, Fig. 6) with different kinds of structures and for a variety of objectives. Deep Convolutional Generative Adversarial Networks (DCGAN) (Fig. 5 a) [32] include similar to Vanilla GAN [22] a generator and discriminator and ,nevertheless, instead of a fully connected layer, incorporates a fully convolutional layer which produces improved synthetic images and stabilizes the training process. Likewise, Batch Normalization and LeakyReLU activation function were two important modifications. As an alternative, Wasserstein GAN [33] (WGAN) (Fig. 5  b) replaced the discriminator with a critic where instead of predicting the probability of synthetic images as being real or fake, scores regarding the realness or fakeness of a given image are provided. The generator is trained by minimization of the distance between the distribution of real and generated examples. Progressive Growing of GANs (PGGANs) (Fig. 5 c) introduced by Karras et al. [34] to improve quality, stability and variation. The rationale of this approximation is to progressively increase the generator and discriminator, which starts from a low resolution and adds new layers, whereas the training progresses. Variational autoencoders (VAEs) (Fig. 5 d) [23] are another variant of autoencoder where the network through mapping the input into distribution in the latent space with the encoder network, enables the ability to sample first the latent vector from the distribution and then using the decoder to generate new data.

Image to image
The pix2pix GAN [35] (Fig. 6 a) is a supervised imageto-image model, where a target image is synthesized conditional on a given input image. The cyclic adversarial generative network (CycleGAN) (Fig. 6 b) [36] is composed of two generators and two discriminators to perform higher-resolution image-to-image translation using unpaired data. Multimodal Unsupervised Imageto-Image Translation [37] (MUNIT) (Fig. 6 c) architecture trains two auto-encoders, one to encode the content Fig. 3 The graph depicts the gradual increase of studies on data augmentation with synthetic MRI examinations to enhance cancer differentiation. In the year 2020, the number of studies exceeded the total number of the previous three years indicating the interest of researchers in the field Fig. 4 Summary results of QUADAS-2 tool on risk of bias and applicability concerns for the included studies in the present systematic review and the other to encode the style of images. Furthermore, the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. The architecture recombines the content code with a random style code sampled from the style space of the target domain, to translate an image to another domain. Figure 7 summarizes the previously mentioned architectures proposed in the examined studies. The category of "Translation architectures" includes mainly the pix2pix, CycleGAN and MUNIT architectures, whereas the "hybrid architectures" are when GANs and VAEs are combined.

Brain
Beers et al. [38] were the pioneers who employed PGGANs on retinal fundus images and brain tumor multi-modal MRI images. The generative network was able to synthesize high-resolution medical images that were both realistic and phenotypically diverse. The authors concatenated the segmentation glioma maps along with multi-modal MRI images as color channels, to allow the network to synthesize anatomically correct tumor structures in the synthetic MRI slices. The primary idea of the DCGAN compared to vanilla GAN is that adds transposed convolutional layers between the input vector Z and the output image in the generator. In addition, the discriminator incorporates convolutional layers to classify the generated and real images with the corresponding label real or synthetic. b Training a GAN is not trivial. Such models may never converge and issues such as model collapses and vanishing of gradients are common. WGAN proposes a new cost function using Wasserstein distance that has a smoother gradient. The discriminator is referred to as the critic who returns a value in a range, instead of 0 or 1, and therefore acts less strictly. c The training in PGGAN starts with a single convolution block in both generator and discriminator leading to 4 x 4 synthetic images. Real images are downsampled also to be of size 4 x 4. After a few iterations, another layer of convolution is introduced in both networks until desired resolution (e.g., 256 x 256 in the schematic). By progressively growing the network learns high-level structures first followed by finer-scale details available at higher resolutions. d In contrast to traditional autoencoders, VAE is both probabilistic and generative. The encoder learns the mean codings, µ , and standard deviation codings, σ . Therefore the model is capable of randomly sample from a Gaussian distribution and generating the latent variables Z. These latent variables are then "decoded" to reconstruct the input Han et al. [39] proposed a methodology for addressing the tumor diversity in generated medical images with realistic morphological characteristics. Thus, they compared two generative architectures, DCGAN and WGAN for the sake of this objective. The latter architecture produced synthetic images with increased tumor diversity in multi-sequence MRI, and it was successfully captured the sequence-specific texture and tumor appearance (Fig. 8  a). In addition, an expert physician evaluated via Visual Turing Test [40] all the generated images derived from WGAN as more realistic. Employing different types of generative models such as PGGAN and with the aim to synthesize a single image type (Fig. 8 b) Han et al. [41] adopted these multi-stage generative architectures to augment the original dataset and perform a downstream task. In another work, Han et al. [42] implemented a noise-to-image and image-to-image adversarial architecture ( Fig. 8 c1) for sample generation (Fig. 8 c2) to increase the performance of the classification in terms of accuracy, sensitivity and specificity. Additionally, the proposed pipeline was divided into a two-stage approach to improve model convergence. Initially, the PGGAN generates the high-resolution MRI images and two translation frameworks MUNIT [37] and SimGAN [43] were used consecutively to refine the synthetic images and increase their realism, and anatomical diversity. The combination of traditional data augmented techniques with the refined images increased the performance of Generative architectures for image-to-image translation. a The pix2pix is an extension of the conditional GAN architecture that provides control over the generated image. The U-net model generator translates images from one domain to another and through skip connections the low-level features are shared. The discriminator judges whether a patch of an image is real or synthetic instead of judging the whole image, while the modified loss function allows the generated image to be plausible in the content of the target domain. b CycleGAN is designed specifically to perform image-to-image translation on unpaired sets of images. The architecture uses two generators and two discriminators. The two generators are often variations of autoencoders where they take as input an image and output an image as output; the discriminator, however, takes as input an image and outputs one single value. In the case of CycleGAN, a generator gets further feedback from the other generator. This feedback confirms whether an image generated by a generator is cycle consistent, meaning that applying successively both generators on an image should produce a similar image. c In the MUNIT architecture, the image representation is decomposed into a content code and style code through the respective encoders. The content code and style code is recombined to translate an image to the target domain. By sampling different style codes the model is capable of producing diverse and multimodal outputs the downstream task and improved tumor detection. The Conditional Progressive Growing of GANs (CPGGAN) [44] is an expansion of a previous architecture [41] and is conditioned to generate MRI images with brain metastases at specific positions and sizes (Fig. 8 d) since brain metastases are the most common intracranial tumors, getting prevalent as oncological treatments ameliorate cancer patients' survival [45].
Shin et al. [46] used the pix2pix GAN model to generate synthetic MRI images with brain tumors. At first, the model was trained to segment normal brain anatomy from the T1-weighted images of the ADNI dataset. The same model is used again on different MRI sequences of the BRATS dataset. Combining brain anatomy and tumor segmentation, the overall segmentation of the brain with tumor is acquired. The authors used the segmentation masks and the BRATS dataset to generate the synthetic brain MRIs with lesions in four different sequences. Furthermore, by adjusting these masks (e.g., shifting tumor size, changing tumor location or locating tumor on an otherwise tumor-free brain label), they introduced variability in the anatomical regions and tumor characteristics. Lastly, when the original dataset augmented with synthetic images. the results of the downstream task revealed a performance that outperformed compared to the model trained with only real data.
The Asynchronized Discriminator GAN (AsynD-GAN [47,48]) is a distributed learning framework that is comprised of a 9-block ResNet auto-encoder (generator) and various PatchGANs [35] (multiple discriminators) that can capture the localized anatomy of real and synthetic data ( Fig. 9 a). The centralized generator learns the joint distribution of multiple data from different institutions where for each institution exist a discriminator to classify the local real data and the synthetic data. Thus, the framework ensures that the generated images can be shared across multiple institutes with no privacy concerns and promote collaborative research. The performance of a downstream task such as brain tumor segmentation, suggests that models trained with ASynD-GAN synthetic data achieved close-to-real performance when compared to models trained entirely on real data.
Medical imaging datasets are frequently limited in terms of size and diversity, especially in oncology since the natural prevalence of the disease leads to imbalanced sets. Deepak et al. [49] applied a multi-scale gradient GAN (MSG-GAN) due to simplicity and robustness compared to DCGAN and PGGAN, to generate meningioma tumor MRI samples in coronal plane. The progressively growing generator accomplished to augment the class imbalanced dataset by 55 samples, improving the balanced accuracy score up-to 93%. Likewise, Qasim et al. [50] modified SPADE-GAN [51] and proposed Red-GAN, by introducing an adversarial pipeline conditioned on both local and global information. The 7-SPADE ResNet as a generator synthesized MRI images across multiple modalities from lesion masks ( Fig. 9 b) and then together with real images was separately given as input to the U-Net segmentation model to ensure that synthetic and real images are in close proximity in the latent representation. The feature representations were obtained from the U-Net along with lesion masks, and the corresponding synthetic or real slices were fed as input to the PatchGAN discriminator. Dice performance when the downstream model trained only on synthetic images reached 0.659, while an increase up to 5% was achieved for most of the sparse classes when the original dataset was augmented with synthetic samples.
VAE is stable during the training process, but blurry images can be produced. On the other hand, GANs can synthesize realistic images, but they are unstable during training. Kwon et al. [52] proposed a 3D-GAN model that leverages the α-GAN [53] architecture which essentially combines the advantages of the aforementioned networks with an additional auto-encoder and a code discriminator on the top of the existing generator and discriminator. Moreover, the Wasserstein distance with gradient penalty was introduced to reduce the training instability. The authors claim that the model generates realistic samples with brain tumor lesions at various positions while properly reacting the characteristics of different modalities ( Fig. 9 d). By using principal component analysis (PCA) cluster representation, it was demonstrated that a moderate larger latent noise vector assists the model to escape from model collapse. The combination of auto-encoders and GANs [54] was proposed as a hybrid GAN framework to improve diversity in local areas of MRI images and augment the available samples in the examined dataset. Initially, real MRI images are divided into equal-sized patches and fed as input to the encoder-decoder module where synthetic patches can be sampled. Next, the generated patches along with a constrained noise vector are set into RU-NET generator, where finally fake patches are integrated into full-sized synthetic images. Notably, binary classification (i.e., tumor, non-tumor) performance decreased when combining synthetic with real data, while the accuracy reached at highest when the model trained only on synthetic data. Pesteie et al. [55] proposed a conditional VAE fitted with a novel adaptive training algorithm. This technique was applied in ultrasound and MRI data for sample generation in segmentation tasks. The examinations, annotations and latent variables were kept, independent. The model learns to synthesize data from joint distribution composed of a random latent sample and an encoded segmentation mask. Additionally, two augmentation techniques were employed with a static and a trainable adaptive parameter for deforming the input segmentation mask.
Lesion segmentation in medical images is a challenging task and can be achieved by automated or semi-automated detection of lesions or organs within 2D or 3D examinations. The high variability of tumors in terms of shape and texture is the major challenge for segmentation tasks. Generative models are suitable for diversifying limited datasets with new samples. In particular, to address the issue of overlapping pixel intensities of regions of  [39]. b Example of T1 contrast-enhanced synthetic tumor and normal examinations in both successful and failed cases [41]. c1 The proposed noise-to-image and image-to-image combined architectures for tumor detection [42]. c2 Example of T1 contrast-enhanced synthetic tumor and normal examinations in both successful and failed cases [42]. d Synthetic T1 contrast-enhanced samples with the tumor bounding boxes [44]. By "non-tumor" areas the authors refer to "normal examinations" interest (ROI) with other tissue types in brain MRI sequences which can make challenging the automatic pixelwise segmentation, Hamghalam et al. [56] proposed the enhancement and segmentation GAN (Enh-Seg-GAN) with the aim to generate enhanced patches, with no substantial class overlap ( Fig. 9 c). The synthetically enhanced patches were derived from adaptive recalibration, encoder-decoder block (i.e., generator), and then identified from a Markovian discriminator.
Qi et al. [57] highlight and address the limitations of generating brain MRI images with tumor characteristics in previous studies [44,46]. Specifically, the quality of the generated tumor masks is low and the actual position of the tumor compared to its mask has to be redefined manually. This can lead to changes in the image prior leading to an increase of false positives per slice. Moreover, the adversarial loss all alone is not adequate to synthesize realistic tumor images from normal MRI images [44]. Driven by the success of cycleGAN, SAG-GAN [57] was introduced to overcome these drawbacks. Its architecture includes two generators, each of which maps: (a) normal images to tumor images; (b) tumor images to normal images. The authors incorporated the idea of a semi-supervised attention mechanism into the Key generated samples for the examined studies. a The input of the AsynDGAN network, the generated sample and the corresponding real image [47,48]. b Generated images conditioned on lesion masks [50]. c An example of generated images with the corresponding segmentation and groundtruth. The colors mean yellow: edema, blue: non-enhancing, and green: enhancing tumor. The 3D representation of the tumor is presented on the top right [56]. d Synthetic samples of severe cases of brain tumor, for better visualization the authors displayed color-mapped images where yellow indicates higher and blue indicates lower intensity [52] generative network. Specifically, adding attention in the channel module allows the model to focus on channels with informative features and suppress the less useful information. Furthermore, in the architecture of the generators, an attention network aims to select the area to generate tumor and to locate the place with the tumor, leading to the generation of the probability map. On the other hand, an attention mechanism is also included in the discriminators to emphasize only the regions inside the attention map.
Guo et al. [58] proposed a SAMR framework to synthesize meaningful high-quality sequences of anatomic and molecular MRI images from arbitrary manipulated lesion information. The generator is comprised of four components; (a) a down-sampling module where the lesion segmentation maps (i.e., background, norm1al brain, edema, cavity caused by surgery and tumor) are given as input to get a latent feature map; (b) an atlas encoder that takes the analogous multi-model atlas of size 256 x 256 x 15 to get another latent feature map; (c) a set of residual blocks where the concatenation of the two latent maps is given as input to learn better transformation functions and representations; and (d) the stretch-out up-sampling module where the synthesis of MRI slices of size 256 x 256 x 5, takes place. In the other part of the adversarial learning, multi-scale PatchGAN discriminators were adopted. An expert neuroradiologist verified the pathological information of the synthesized images. Additionally, quantitative results on the external datasets (i.e., BraTS 2018) showed the superiority of the proposed method compared to other architectures [35,46,59]. MRI synthesis is a challenging task since radiographic features vary and pathological information includes high-frequency components. Thus, special attention is required to deal with the uncertainty [60]. To achieve this, the authors extended their previous work [58] and proposed the Confidence-Guided SAMR (CG-SAMR) [60] incorporating two crucial modules. In particular, the generator comprises of two components, an encoder and decoder with stretch-out up-sampling block. The latter component includes a synthesis module and a confidence map module. The rationale behind this is, instead of directly synthesizing MRI images from input, to initially estimate the intermediate synthesis results at half scale size, and simultaneously the corresponding confidence map is calculated by the loss function. This gives attention to uncertain regions and prevents the propagation of incorrect estimation, and therefore the synthesis of the final output is created. The discriminator components (i.e., multi-scale labelwise discriminators) remained the same as in their primary work. In addition, the proposed architecture is extended to be trained in an unsupervised fashion without the necessity of paired data (UCG-SAMR).
Quantitative results reported an improvement compared with the previous method [58] and other existing architectures [35,59]. Likewise, UCG-SAMR outperformed on pixel accuracy, SSIM and PSNR metrics; however, the network achieved the second-best performance in terms of dice score against other models [61][62][63].
Isocitrate dehydrogenase 1 (IDH1) mutation information is crucial for diagnosis, prognosis and guidance in clinical decisions due to observation that IDH1 mutated gliomas have an improved overall survival rate rather than with IDH1 wild type [64,65]. However, this is molecular-level information which makes the identification a challenging task for machine learning methods. Ge et al. [66] proposed a workflow consisting of three modules to improve glioma subtype classification. The Pairwise GAN which is essentially a bi-directional crossmodality model was trained to augment data across multiple domains. Two use cases were conducted to analyze the usage of synthetic data. In the first one, the original dataset was enlarged with synthetic and in all four modalities and an increment of 2.94% and 12.73% in classification accuracy and sensitivity, respectively, was reported. However, a 1.74% decrease in specificity was reported indicating a slightly more increase of false positives. In the second, the training dataset was augmented both with synthetic images and missing scans from various modalities and an improvement on the aforementioned metrics was observed except specificity that remained same as in the baseline method. Most notably, the overall performance of the downstream task has been significantly improved by the inclusion of lesion masks in the proposed analysis.
A large number of examinations in similar datasets are not annotated on a pixel-basis, which makes training supervised DL models difficult. In [67] the authors proposed an extension to their previous adversarial architecture [66] where initially a multi-stream 2D CNN is trained to extract features from a sparsely labeled dataset followed by a graph-based semi-supervised model to assign labels to unlabeled examinations. Thus, data from unlabeled and labeled sets are fed into this pairwise GAN to enrich the original datasets. Then, during the testing phase, GAN-based data along with real labeled and unlabeled were used by multi-stream 2D CNN to learn gliomas-related imaging features. Finally, a higher-level classification module was integrated into the adversarial architecture to predict the tumor molecular subtypes.
Carver et al. [68] proposed a methodology for increasing tumor variability in terms of size, shape and location in synthetic multi-parametric MRI images, in addition to investigating how different subsets of generated images could affect segmentation performance. This architecture was trained in a supervised manner since the generator requires pixelwise annotations which in turn were derived from real MRI images. Initially, the first discriminator, a pre-trained VGG-19, was used to calculate the perceptual and per-pixel loss, whereas a second discriminator (PatchGAN) was utilized to penalize on an image patches-basis. The authors performed also qualitative analysis to further examine both the overall and inter-modality quality of synthetic images. An expert physician assessed the generated images through the Visual Turing Test and performed an in-depth analysis of pairs of synthetic and real images. It was noted that synthetic images displayed high quality with plainly defined structural boundaries; however, they were lacking in presenting the details related to edema. Nevertheless, the overall segmentation performance increased by 4.8% on the average dice similarity coefficient (DSC).
Mok et al. [69] proposed a coarse-to-fine boundaryaware generative adversarial network (CB-GAN) to synthesize high-resolution multimodal MRI images of high-grade and low-grade glioma patients. In particular, this architecture consists of two generators and four discriminators. Primarily, instead of feeding as input to the network a noise vector from a normal distribution, the authors replaced it with a semantic segmentation mask as a condition variable for introducing diversity to the generated images with different tumor shapes and preventing mode collapse. Consequently, the coarse generator is bounded to generate the primordial shape and texture of synthetic images, while a multi-task generator aims to preserve tumor boundaries by incorporating a desired invariance and robustness to the network. Multiple discriminators with different scales of input were adopted to capture both global and local information. The proposed pipeline improved the performance over traditional data augmentation methods on average by 3.5% regarding dice score and furthermore outperformed other state-ofthe-art methods [70,71] for the enhancing tumor task in terms of dice precision.
Dikici et al. [72] proposed the constrained generative adversarial network ensembles (cGANe), which is essentially an aggregate of DCGANs. The selection of which DCGAN will pass into cGANe framework was based on FD [73] computed score which had to be lower than a predefined threshold value ω . The population of cGANe was determined by a brain metastasis detection algorithm [74], which was evaluated by calculating the average number of false detections per patient. To eliminate the generation of a synthetic data sample that resembles an original data sample from the training set which is crucial in terms of anonymity, the mutual information metric was used. T-SNE cluster representation visualization revealed that cGANe40 generated convincing synthetic images indistinguishable from real samples.
Kamli et al. [75] incorporated in the proposed adversarial pipeline an anonymizing model [46], to increase prediction performance in patients with glioblastoma multiforme tumor growth. The authors did not modify the tumor size and shape as in [46], because any modifications on these parameters could affect the accurate prediction of tumor growth. The Synthetic Medical Image Generator (SMIG) was trained to generate tumors in varying locations such as healthy brain regions. The authors experimentally showed that the augmentation of the dataset by up-to 80% of the samples being synthetic and 20% real data improved the segmentation performance. For this purpose, a fully automatic brain Tumor Growth Predictor (TGP) model which is based on a convolutional auto-encoder model [76] was integrated into the pipeline. Furthermore, it should be noted that preprocessing steps had a positive effect on the quantitative metrics.
Pseudoprogression (PsP) and true tumor progression (TTP) in glioblastoma multiforme (GBM) can occur after standard treatment. The distinction between them is mainly based on MRI analysis of the lesion area, which is a time-consuming procedure for clinicians. Li et al. [77] merged DCGAN and AlexNet and proposed DC-AL GAN that trained in an adversarial way on longitudinal diffusion tensor imaging (DTI) data to discriminate PsP and TTP in MRI images. In particular, the generator aims to create fake pair samples of 512 by 512 pixels in size that are similar to original data, while a modified AlexNet is placed in the role of the discriminator to extract highlevel refined features. The proposed discriminator incorporates a multi-feature selection module to concatenate deep coarse features with shallow fine features. Finally, the aforementioned features were flattened and employed as input to an SVM classifier.
The number of patients, lesion types, modalities, preprocessing and downstream tasks are demonstrated in Table 1. The generative methodologies, including the deep generative architectures, and hyper-parameters are shown in details in Table 2. An evaluation comparison among the examined studies is presented in Table 3.

Prostate
DCGANs and cGANs are novel variants of generative models published after their first appearance [22]. Kitchen et al. [78] employed DCGANs in 2017 to synthesize 16 by 16 prostate MRI patches across three modalities such as T2, ADC, K trans . One year later Hu et al. [79] employed cGANs that take Gleason scores as a condition in the training process with the aim to synthesize focal prostate diffusion images of size 32 by 32.
ADC values derived from diffusion-weighted MRI are useful non-invasive biomarkers for accurately assessing  [80]. However, such data are often scarce and thus it is limiting the usage of DL architectures. Wang et al. [81] present a stitch AD-GAN to synthesize the ADC data at a size of 64 by 64. Initially, the target image space was divided into subspaces each of which had a lower scale in order to reduce the complexity of the data manifold. Thus, instead of directly synthesizing the image in the target space that would affect the quality of data, it is generated four 32 by 32 sub-images in the divided subspaces. Then, a nonparametric Stitch layer was employed to interlace these sub-images into the fullsize target image. Additionally, the discriminative module consists of two critic networks: (a) the first minimize Wasserstein distance between synthetic and real CS PCa data; and (b) the second maximize the auxiliary distance JSD among synthetic CS PCa and non-CS PCa. By incorporating a StitchLayer and the aforementioned loss functions the network was capable of synthesizing full images with global structure and precise local information. Yang et al. [82] presented a novel bi-parametric (i.e., ADC-T2w) image generation network based on semisupervised learning. In particular, the proposed framework consists of two generative modules that synthesize corresponding images in the two modalities in sequential order. The sequential synthesis mainly consists of three modules (a) a pre-trained on ImageNet [2] Inception-V3 network that extracts hierarchical features of both real and fake images to measure the complexity of two modalities; the modality with the lower complexity is synthesized first (b) an encoder which maps a real image of each modality to a low-dimensional latent vector and (c) a synthesizer which first decodes the latent vector to a fake image and then translates it to another modality. Semi-supervised sequential synthesis of bi-modality images achieved superior performance in all evaluation metrics and classification accuracy when compared with supervised sequential GANs, unsupervised sequential GANs and parallel GANs. Wang et al. [83] combined these studies [81,82] and proposed an improved semisupervised architecture to synthesize mp-MRI data with sufficient diversity to include meaningful CS PCa from a small amount of training data. Initially, a decoder derives low-dimensional ADC maps from 128-d latent vectors, then a StitchLayer converts the low-dimensional ADC maps to a full-size ADC image, and finally, a U-Net is used as an image translator to convert ADC image to a paired T2w. Complexity measurer was excluded as ADC maps are proven to be easier to synthesize first due to low spatial resolution.
Fernandez-Quilez et al. [84] proposed a semiautomatic pipeline with two generative models in a sequential fashion to generate synthetic pairs of T2-weighted prostate MRI and their corresponding whole gland mask. Initially, a DCGAN model was trained on PROMISE 12 dataset to synthesize whole prostate gland masks. The selection criteria for synthetic masks are based on the visual appearance and done manually. Afterward, a pix2pix architecture converted the synthetic whole gland masks into T2-weighted modality leading up to 10 thousand synthetic paired data. For evaluation a segmentation task was performed, where a U-Net architecture trained with multiple data including real data, synthetic, classical augmentation and combinations of them.
Taking into consideration the DCGANs drawbacks, Yu et al. [85] proposed CapGAN which features two major modifications. First, the capsule network replaced the CNN as a discriminator to better achieve an equivariant representation of images that is robust to the changes in the pose and spatial relationship of objects in the images. Second, the least-squares loss was adopted for both generator and discriminator to address the vanishing LGG, low-grade gliomas; HGG, high-grade gliomas; T1ce, T1 contrast-enhanced; FLAIR fluid-attenuated inversion recovery; T1adjacent w, T1-weighted; T2w, T2-weighted; GBM, glioblastoma multiforme; Gd-T1w, gadolinium-enhanced; APTw, amide proton transfer weighted; IDH, isocitrate dehydrogenase; PSP, pseudoprogression; TTP, true tumor progression; DTI, diffusion tensor imaging, symbol '-' represents that the corresponding information was not provided in the publication  gradient problem of the sigmoid cross-entropy loss function and simultaneously generate higher-quality images. Furthermore, to evaluate the synthetic images both qualitative and quantitative metrics were introduced. Qualitative evaluation was conducted through visual inspection by two experts, while quantitative evaluation was performed in terms of KL divergence by calculating the probability distribution of synthetic and real images based on a pre-trained SVM classifier. The downstream task such as classification was applied via a combination of LENet [86] and NIN [87] architectures. The whole framework was applied both to prostate and simulated brain MRI images. To address the cross-client variation problem in medical image data, Yan et al. [88] proposed a variation-aware federated learning (VALF) framework. Three different GAN architectures were employed to address privacy concerns. First, the data complexity of all clients was calculated to define the target image space. In detail, a WGAN with gradient penalty was trained to generate synthetic data and PCA with t-SNE was applied to extract discriminative imaging features. The complexity score for each client was defined by calculating L2 distance between the features of generated and original data. The client with the lowest complexity was selected as the target image space. Afterward, to share the defined image space with other clients, a collection of images is synthesized via a privacy-preserving-adversarial network (PPWGAN-GP), and a subset of them which can effectively capture the characteristics of the raw images but without leaking the privacy is selected. Finally, by a modified cycle-GAN, the corresponding raw images of each client are converted into target image space defined by the shared synthetic images.

Pancreas
Gao et al. [89] employed a GAN in a recent study [32] to improve a pancreatic neuroendocrine tumor (pNET) differentiation model. The generator incorporated a fully connected layer followed by several fractionally strided convolutional layers with a kernel of 4 by 4 pixels. The input vector of 100 noise values was selected from a normal distribution to be transformed by the generator into synthetic T1ce patches of 56 by 56 pixels. The authors emphasize the value of the generated images, which can assist with the difficult task of gathering radiological examinations of rare diseases. The proposed analysis, which incorporates a generative model and deep learning classification, demonstrated the capacity to discriminate among the World Health Organization (WHO) grades of pNET on T1 contrast-enhanced MRI. The generated patches were evaluated by two radiologists to ensure the quality of the training samples. In another work, [90] a DCGAN was used to increase the extracted ten thousand patches of regions of interest up to twenty-five thousand training samples for pancreatic disease classification. The original dataset was comprised of 448 patients with T1-weighted dynamic contrast-enhanced MRI examinations and an external validation set of 56 patients. A total of twenty-three diseases were grouped into seven classes.
Since the largest out of seven groups of patches was the carcinomas, it was left out of the generation process to improve the imbalanced dataset. Consequently, six generative models were trained on the remaining groups, independently for each class. This resulted in a balanced training set across all seven groups. One radiologist examined the generated images to ensure the validity of the process.

Breast
Generative models have been implemented for synthesizing multi-parametric breast MRI image patches by Haarburger et al. [91]. Two custom GAN architectures were employed to generate sequences of T2 and T1-DCE image patches that were conditioned on healthy tissue, benign and malignant lesions. Initially, an experienced radiologist segmented manually all suspicious and non-lesion areas on every slice, leading to a dataset of 401,525 patches in total. DCGAN and WGAN were trained to synthesize patches 64 by 64. The generated samples were evaluated qualitatively by expert clinicians as well as quantitatively. Despite most of the patches being morphologically realistic, it was also observed that synthesized images included fat-shift artifacts. In terms of quantitative metrics, DCGANs were superior to WGANs, but qualitatively minor differences were observed.

Liver
Sun et al. [92] implemented a manifold matching generative adversarial network (MM-GAN) for augmenting the training set up to 500%. The effect of different levels of synthetic data ranging from 0 to 500% was also tested in segmentation tasks on two datasets: BRATS17 and LIVER100. In particular, the performance in terms of DSC of the glioma segmentation (T1ce) was improved  for the whole tumor by 0.17 and the tumor core by 0.16 on the unseen testing set. Additionally, only a small fraction of the original dataset (29 samples) was used for fine-tuning the model that was trained exclusively on synthetic data without compromising the segmentation performance. The synthetic data were assessed visually by observing the brain structure and the retained details (namely the cerebrum, cerebellum, diencephalon, brainstem, sulci, gyri). A key aspect of the synthetic images was that the brain anatomy was adapted with respect to the given segmentation mask while still displaying a noticeable difference in appearance between real MRI images.
Details about the examined datasets, pre-processing methods, deep generative model architectures, and a comparative analysis are shown in Tables 4, 5 and 6.

Discussion
Deep learning applications regarding medical image analysis applications require big databases from different acquisition protocols in order to effectively capture the intra-class variability in lesion types. This is especially relevant in oncology, where in many cases the natural prevalence of the disease, anatomical variability and tumor heterogeneity cannot be modeled due to the limited available data. Deep generative models can potentially Table 4 Details of the datasets and data processing methods used in the examined studies

Synthetic data evaluation
Three types of synthetic data evaluation and their combinations were followed in the examined studies: a) direct by specific image quality metrics (MSE, MAE SSIM, etc.): b) indirect by examining the delta of the performance prior and after samples generation: and c) qualitative by either expert clinicians or by plotting cluster distribution of the generated samples. A direct assessment of the generative task using statistical metrics was performed by twelve studies [52,54,56,60,66,68,72,[82][83][84][85]91] prior to downstream model training in order to verify the validity and quality of the generated samples. These types of metrics provide an objective and quantitative process of assessing the generative models, allowing comparison among different methodologies. On the contrary, the majority of the examined studies, as illustrated in Fig. 10, incorporate an indirect evaluation of the generated images. Additionally, seventeen [38, 46-49, 54-57, 60, 67, 69, 75, 78, 79, 81, 92] of these studies visualize a selection of the synthetic samples, and thirteen [39,41,42,44,58,66,68,82,83,85,[89][90][91] were qualitatively evaluated by experienced clinicians, as illustrated in Fig. 11. However, due to the large amount of synthetic data generated, this sort of qualitative evaluation is prone to errors and inter-observer variability [90]. Furthermore, some studies that employ expert clinicians to assess the generated samples report large variability in the scores [82,83]. A significant number (nineteen) [38, 39, 44, 46, 49, 50, 52, 54, 56, 60, 72, 77-79, 81, 89-92] of those papers did not provide sufficient statistical metrics (four of them none) [38,39,78,79] for assessing the impact of synthetic data on model performance. This is likely to have led to the insufficient evaluation of the generalization status in the examined downstream tasks with reduced performance in the unseen data. Because deep learning-based techniques are known to be prone to overfitting and noisy information memorization, this is a significant disadvantage for model trustworthiness and robustness of the generative models.

Limitations of this review
This study has some limitations. It was particularly difficult to identify studies that incorporated GANs or VAE for data augmentation since, in many papers, it was merely a component of their overall study pipeline or was only referenced briefly in a paragraph with little information. Furthermore, in the majority of studies, the authors did not provide the necessary information (analysis protocol, hyperparameters, measurements, etc.) that would allow us to fully assess the quality of each study and objectively evaluate their findings. As a result, most of these experiments are impossible to replicate. The original source code or custom datasets are not publicly available for the examined manuscripts, but only in a small number of studies, making extraction of the required hyperparameter problematic.

Limitations of the reviewed papers
There are different strategies to mitigate the limited population of the original dataset, including subsampling of examinations at a slice level (from 3D volume to 2D slices, twenty-six studies) [38, 39, 41, 42, 44, 47-50, 54, 55, 57, 58, 60, 66-69, 75, 77, 79, 81-84, 88] and at a patch level (from tumor to sub-regions of the tumor, seven studies) [56,72,78,85,[89][90][91]. These subsampling techniques can result in the loss of key features of tumor heterogeneity, significant voxel-based and spatial information with morphological features such as sphericity, shape and volume. This is likely to negatively impact the generalization ability of both the generative model and decision support systems. The inherent heterogeneity of cancer images emanating from specific genetic traits, local mutational diversity, varying shape attributes, unclear boundaries, multiple subtypes and stages can significantly affect clinical outcomes. Because of all these parameters, capturing discriminative imaging markers for assessing the output variables can become challenging when generating images that include highly heterogeneous regions of interest.
Random noise vectors have been utilized as input for data generation in twenty-one [38, 39, 41, 42, 44, 49, 52, 54, 72, 77-79, 81-85, 88-91] and pixel-level lesion annotations in twelve studies [46-48, 50, 58, 60, 66-69, 75, 92]. Although this ensures the generation of different types of tissue, including tumor regions, it may lead to less variety in terms of shape and volume of the examined anatomical regions. A potential solution to this issue was proposed by Pesteie et al. [55] in which random deformation of the semantic segmentation mask was performed prior to generating new samples.
The reproducibility of deep learning models in medical imaging is a major challenge since decision support systems must adhere to the relevant legislation and be licensed by the respective regulatory bodies. Many studies, as evident by tables 1, 2 and 4, 5 , provide incomplete experimental protocol information with missing key parameters and data processing details. In addition, many studies are based on proprietary datasets, making comparisons with similar approaches challenging. Open datasets and publicly available source code repositories could, to an extent, address these issues and accelerate progress with respect to the current state-of-the-art methods. The open-source code from reviewed papers on the GitHub repository is presented in Table 7. Evaluation methods for assessing the generative process: an indirect metric for the downstream task (i.e., classification, segmentation, detection, etc.) where is calculated the performance prior to and after sample generation. Qualitative analysis is where expert clinicians assess the generated images with statistical methods or the via Visual Turing Test. Direct assessment of generated samples with image quality metrics (e.g., MSE, FID, IS, etc.), and studies without any metric Fig. 11 Qualitative methods were used in the examined studies for evaluating the generated samples. Almost half of the examined studies evaluated the synthetic images by visualization, whereas 36.1% employed expert clinicians to assess the generated samples using statistical methods or an operator-assisted device that produces a stochastic sequence of binary questions from a given test image (i.e., Visual Turing Test). The 11.1% used cluster visualization methods such as PCA and t-SNE, and a small percentage (5.5%) did not use any qualitative method to assess the synthetic images Class imbalance on a patient-basis regarding the examined disease can potentially result in lower performance in the minority class in both the generative model and downstream tasks [67,90]. Consequently, this could likely compromise the diagnostic value of the deep model. Stemming from the limited patient data in most datasets, the lack of anatomical diversity is a critical issue in oncological imaging since the available tumor pixels are far less than the other types of tissues in the examined volume of interest [44].
A trade-off between signal quality and noise in the generated MRI examinations is a key element during the convergence of generative models to capture the granular imaging patterns in each class distribution. Constraints during generation might be implemented to ensure that the intensities of pixels are uniform and realistic [93]. There are also concerns regarding the image quality of the generated samples (low-resolution, distortion, blurriness, etc.) [90,94,95]. Additionally, variations [60] in spatial resolution and pixel coordinates among MRI examinations may compromise the generalizability of the analysis when raw data are used. Thus, resampling to harmonize spacing is a necessary pre-processing task for any (2D or 3D) convolutional deep model, but this can also substantially affect the underlying hidden tumor patterns in MRI images [90].
Deep models trained on a single medical center might capture biased distributions. Thus, external validation sets are of paramount importance for generalization [58,89,90]. Additionally, evaluating generated images by expert radiologists cannot be always considered a feasible option due to their limited available time, the highdimensionality of MRI images and the subjective nature of the tasks often requiring multiple clinicians in order to minimize inter-observer variability. Additionally, as is evident in Table 6, the difference in qualitative scoring for the generated samples by expert clinicians can be substantial.

Advancing generative models for radiology applications
Novel quantitative and qualitative methods should be developed [96] to provide insights not only about how realistic a generated image is but also to ensure that regions of interest are anatomically correct [93] and a true representation of MRI scans. Additionally, generating 3D MRI volumes instead of 2D slices [46,52,92] can significantly improve the convergence of the downstream tasks since imaging features based on three-dimensional raw data can increase robustness and generalizability. Additionally, key advancements in generative models include the improvement in the fidelity [60,77,93,97] and reduction of the smoothed patterns [93] of MRI images via denoising or other voxel-based techniques.
Ge et al. [67] suggest that GANs can be extended to capture rare genetic alterations that have a significant impact on assessing the response to targeted treatments. Wang et al. [81] employed a stitch layer in the generator to address the difficult-to-optimize problem in most GANs for high-dimensional image synthesis in prostate MRI. Different techniques have been introduced to improve image quality [91] and enhance deep generative model convergence [42,49,[57][58][59][60]68].
GANs trained on multicentric MRI data can benefit from scanner variability and further improve generalization of the targeted task [60]. Accordingly, enhancing the render process of MRI itself by utilizing the raw k-space data [98] will advance the current acquisition and reconstruction process, enabling a more optimized quantitative analysis.
Regional legislative frameworks for privacy are posing significant challenges in medical data analysis. Deep generative models can assist in providing full anonymity of medical data [44, 46-48, 72, 75, 88], even on an image level, making it easier to share them specifically the synthetic version of them.

Research challenges and future directions
Despite the active research of generative models on MRI image analysis, non-trivial challenges still remain. Future studies should examine a more diverse patient cohort to capture the tumor variability in terms of shape, location, anatomical region, genetic background, histological subtypes and other clinically significant parameters.
In particular, 35% of the examined studies, as illustrated in Fig. 4, were either unclear about their patient stratification protocol or had a high risk of introducing selection bias like focusing only on large lesions such as high-grade glioma. Data augmentation for rare tumors in anatomical locations such as the pancreas, renal and bones needs further investigation as only a handful of studies were reported in Table 4. The existing generative models have been developed to converge with selection criteria that are ROI sizerestricted. The introduction of architectures that can capture the high variability of tumors is crucial. In particular, more effort should be invested into generating MRI examinations with small lesion ROIs such as low-grade gliomas, lung nodules and other similar-sized neoplasms. Oncology imaging is also characterized by the fact that the population, scanner manufacturers and acquisition methods at different sites vary a lot. Generative models fitted on a diverse set of data could achieve an improved and generalized representation of data distributions that are invariant to these differences.
However, the heterogeneity of data might not be preserved in the generated distribution and artifacts might be introduced due to the drawbacks of current cost functions and architectures. Thus, future architectures should be employed on tumor datasets along with new metrics that are better suited for robust evaluation of generative models, not just for image quality but also for assessing diversity in the generated dataset.
The computational cost and the required time for developing 3D generative models are high. This limits the majority of studies to 2D models and, therefore, key volumetric features cannot be captured by the synthetic distribution. Only two studies [46,92] synthesized a 3D MRI volume, as shown in Tables 2, 3, 4 and 5.

Conclusion
Deep generative modeling is a key technology for alleviating important limiting factors that render data collection challenging, such as the natural prevalence of several cancer types, morphological diversity of lesions and lack of standardization of MRI protocols. Although there are some trustworthiness issues in many of the presented studies, we argue that when implemented properly by strictly following the corresponding best practices and recent advances in this field, generative models have the potential to revolutionize medicine by correcting the class imbalances of the disease in the dataset, diversifying anatomically the available region of interest, providing vendor-specific samples and supporting downstream tasks with larger training sets.